Pre-deploy user-impact fixes: healthz, preview render isolation, cross-origin embeds, upload gate#826
Conversation
GAE serves / as a static file, so production can be fully down (every Express instance crash-looping) while / stays green. /healthz is an Express-routed, unauthenticated uptime-check target that reflects the WASM engine preload via isReady(). It mounts ahead of the request logger and session middleware so constant polling stays cheap and out of the logs. Because initializeServerDependencies() throws before the route mounts, a preload failure surfaces as a non-responding instance, not a 503; the 503 branch is defense-in-depth. Also make a rejected main() log and exit(1) instead of falling into the unhandledRejection logger, which left a zombie process that never bound the port; on GAE that hung instances until the port-bind timeout instead of recycling them with a clear error. This is the in-repo half of issue 693 (the Cloud Monitoring channel/uptime-check/alert setup remains ops-side).
…dline Preview PNG rendering ran the whole protobuf->SVG->PNG pipeline synchronously on the Express event loop through the process-wide WASM DirectBackend, so one slow or pathological model stalled every concurrently-multiplexed request on the instance (GAE allows 100) -- a regression from the 2022 per-request worker_threads design. Restore that isolation: each render runs in its own worker thread with its own WASM instance, behind a 2-slot FIFO limiter, under a 10s total wall-clock budget that includes queue wait (a request whose deadline lapses while queued rejects without spawning a doomed worker). A WASM panic or OOM now kills a disposable thread, not the server. Failures stay per-request: the preview route's existing 500 path is unchanged. Cap automatic_scaling at 8 instances in app.yaml so a render storm or crash loop cannot fan out cost without bound, and make validate-app-prod-config.mjs refuse to deploy if the operator-local .app.prod.yaml lacks max_instances. verify-deploy-build.sh and the staged-deploy verify step now assert lib/render-worker.js ships, because a build that dropped it would 500 every preview while looking healthy. Addresses the isolation, timeout, and instance-cap facets of issue 694; a model-complexity cap below the 10 MB body limit remains open there.
…nd /static CORS The embeddable web component was broken on third-party pages: the engine runs in a Web Worker whose chunk URL resolves (via the component build's assetPrefix 'auto') to an absolute app.simlin.com URL, and the Worker constructor enforces same-origin regardless of CORS, so embeds loaded project data but never initialized the engine. Boot the worker through a same-origin blob: trampoline when the resolved chunk URL is cross-origin: worker-trampoline.ts holds the pure origin-decision and trampoline-source logic plus an injectable spawn shell that briefly intercepts the global Worker constructor to observe the URL the bundler resolved (the new Worker(new URL(...)) expression must stay inline or rspack stops bundling the worker chunk). The imported chunk keeps its https URL identity, so its internal WASM fetch resolves against app.simlin.com; engine-worker.ts overrides the runtime publicPath in the trampolined (classic, blob-origin) context where self.location would otherwise poison it. The same-origin path -- the main app -- constructs through the native Worker with identical arguments, verified against a rebuilt bundle. Serve Access-Control-Allow-Origin: * on the /static GAE handler so the trampoline's cross-origin import and the worker's WASM fetch pass CORS; everything under /static is public build output. validate-app-prod-config.mjs now requires the header in the operator-local .app.prod.yaml, and the deploy smoke list gains a header curl plus a real two-origin embed check (the remaining browser-only semantics are uncoverable by unit tests; issue 825 tracks an automated Playwright harness).
gcloud app deploy uploads whatever .gcloudignore leaves in, entirely independent of git tracking status, so a git-status-clean tree can still trip App Engine Standard's hard 10,000-file cap on gitignored artifacts (cargo target dirs, venvs, scratch checkouts) -- and the failure only surfaced inside gcloud, after the expensive clean+build. check-upload-file-count.sh enumerates the real upload set once via gcloud meta list-files-for-upload immediately before each deploy and aborts with a per-top-level-directory breakdown when the cap would be hit. Both deploy paths run it: the root-path deploy because that is where the unbounded exposure lives, the staged path as a cheap guard against staging-assembly regressions. The cap parameter is overridable only so the failure path can be exercised by hand against a tiny fixture. With the .gcloudignore hardening that landed earlier, this completes issue 695.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #826 +/- ##
==========================================
- Coverage 90.87% 90.87% -0.01%
==========================================
Files 224 224
Lines 143133 143133
==========================================
- Hits 130073 130071 -2
- Misses 13060 13062 +2 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b9586a71ce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| settled = true; | ||
| clearTimeout(timer); | ||
| void worker.terminate(); |
There was a problem hiding this comment.
Await worker termination before releasing render slots
When a preview times out or a worker reports an error under a burst of render requests, this ignores the Promise returned by worker.terminate() and settles the task immediately, so the limiter releases its slot before the old thread has actually emitted exit. Because Node worker termination is asynchronous, queued requests can start new workers while timed-out CPU/WASM workers are still shutting down, defeating the intended MAX_CONCURRENT_RENDERS cap and reintroducing the worker fan-out/memory pressure this change is trying to bound.
Useful? React with 👍 / 👎.
ReviewLooked at the four fix bundles carefully — code, tests, and cross-file interactions. No blocking issues found. Areas I verified as correct:
Overall correctness verdict: correct |
Fixes #688
Fixes #694
Fixes #695
Triage pass over epic #733 focused on issues most likely to impact users of the redeployed App Engine app, skipping compat corner cases (MDL/period-ident items, local-tool issues). Each fix was implemented and then adversarially reviewed by an independent pass; all review findings were resolved before commit.
What's here (one commit per fix)
/healthzroute + exit-on-failed-boot (in-repo half of #693). GAE serves/as a static file, so production can be completely down while/stays green. The new unauthenticated Express route is the uptime-check target; it mounts ahead of the request logger and session middleware. A rejectedmain()now logs and exits non-zero instead of leaving a zombie process that never binds the port. #693 stays open for the ops-side Cloud Monitoring setup (channel, uptime checks, alert policies).Preview render isolation (#694). Server-side PNG previews ran the whole protobuf-to-SVG-to-PNG pipeline synchronously on the Express event loop through the shared WASM backend -- one pathological model stalled every request on the instance (GAE multiplexes 100). Restored the 2022-style worker_threads isolation: per-request worker with its own WASM instance, a 2-slot FIFO limiter, and a 10s total wall-clock budget that includes queue wait (a request whose deadline lapses while queued rejects without spawning a doomed worker).
app.yamlgainsautomatic_scaling.max_instances: 8as a cost cap, the config validator refuses to deploy if.app.prod.yamllacks it, and the build-verification scripts assertlib/render-worker.jsships.Cross-origin embeds (#688). The embeddable web component was broken on third-party pages: the engine worker chunk resolves to an absolute app.simlin.com URL and the Worker constructor enforces same-origin regardless of CORS. The engine now boots the worker through a same-origin blob trampoline when the resolved chunk URL is cross-origin, and the
/staticGAE handler servesAccess-Control-Allow-Origin: *so the trampoline's import and the worker's WASM fetch pass CORS. The same-origin main-app path constructs through the native Worker with identical arguments (verified against a rebuilt bundle during review). The remaining browser-only semantics are covered by a new mandatory two-origin manual smoke check in deploy.md; #825 tracks an automated Playwright harness.Upload-count gate (#695).
gcloud app deployuploads whatever.gcloudignoreleaves in, independent of git status; a clean tree could still trip GAE's hard 10,000-file cap after the expensive clean+build. Both deploy scripts now enumerate the real upload set immediately before deploying and abort with a per-directory breakdown. Together with the.gcloudignorehardening from #809, this completes #695.Also done alongside
.app.prod.yamlmirrorsmax_instancesand the/staticCORS header -- deliberate, documented in the pre-deploy checklist.